Discriminability under Insensitivity: Joint Time-Frequency Scattering vs Time Scattering
This example contrasts the properties of Joint Time-Frequency Scattering (JTFS) and Wavelet Time Scattering (TS), describes some of their applications, and advises when one should be used over the other.
We show:
- Frequency-Dependent Time-Shift (FDTS) discriminability: JTFS is discriminative, while TS isn't.
- Frequency-shift insensitivity: TS is sensitive, while JTFS optionally isn't.
- Time-shift insensitivity: both are insensitive, and both can be invariant.
Introduction
Wavelet scattering is a deep convolutional network formed by cascade of wavelets, complex modulus nonlinearities, and lowpass filters. Unlike a neural network, it uses fixed weights, making it more interpretable and controllable. It yields representations that are time-shift insensitive, robust to noise, and stable against time-warping deformations, with uses in classification, synthesis, and other tasks.
Joint Time-Frequency Scattering extends wavelet time scattering by exploiting time-frequency structure, adding sensitivity to frequency-dependent time shifts and optional insensitivity to frequency transposition, while inheriting time scattering's aforementioned properties.
We contrast these transforms via illustrative examples, and supply real-world context. This example borrows from [1]. FDTS discriminability
A frequency-dependent time-shift (FDTS) is a shifting in time of frequency slices of a signal, where the shifting depends on frequency. One formulation is:
where X is the Continuous Wavelet Transform (CWT), but other time-frequency representations can also be used.
To illustrate, we take windowed pure sines, and shift them more with greater frequency. We use gen_fdts, controlled as follows:
- n_partials: total number of partials, or windowed sines.
- partials_f_sep: f_next = f_prev * partials_f_sep. A constant shift in log-frequency is a multiplication.
- seg_len: partials' effective temporal support.
- f0: initial partial's frequency (which is also the lowest).
- total_shift: total time shift, in number of samples.
The specific choice of parameters is tailored for illustrative purposes.
N = 4096; % signal length
% /10 and /8 are fine also, /9 was chosen for prettier time waveform
total_shift = floor(N/9);
[x, xs] = gen_fdts(N, n_partials, partials_f_sep, seg_len, f0, total_shift);
Next, we configure the scattering objects. Of relevance are the following parameters:
- T: should be set to signal's duration, i.e. global averaging, to maximize TS's insensitivity to FDTS.
- Q & r_psi: chosen to yield good time-frequency resolution tradeoff for the example, and sufficient discretization (see r_psi docs) for faithful displayed plot quality.
- pad_mode: zero-padding is best suited for shift-related illustrations. (E.g. 'reflect' reflects the shifts in xs, rather than shifting x's reflections.)
- analytic: should be false to prioritize effective temporal compactness of wavelets; the accurate A.M. mapping achieved by true isn't relevant here.
- max_pad_factor: should be unrestricted for best results.
- precision: should be double, as these experiments are highly numerically sensitive.
And, specific to JTFS:
- out_exclude: do only spinned coefficients, as we focus on the FDTS-capturing part of JTFS. We can also do this in practice, by emphasizing spinned coefficients or excluding others.
- sampling_filters_fr: compute all spinned coefficients, without exclusions, for completeness.
Parameters which weren't described are specified for backwards compatibility.
% JTFS-specific parameters
sampling_filters_fr = 'resample';
out_exclude = {'S0', 'S1', 'psi_t * phi_f', 'phi_t * psi_f', 'phi_t * phi_f'};
% in case of future changes
T={T}, Q={Q}, r_psi={r_psi}, pad_mode={pad_mode}, ...
max_pad_factor={max_pad_factor}, analytic={analytic}, precision={precision});
cfg_jtfs = dictionary( ...
Q_fr={Q_fr}, sampling_filters_fr={sampling_filters_fr}, ...
max_pad_factor_fr={max_pad_factor_fr}, out_3D={out_3D}, ...
average_fr={average_fr}, out_exclude={out_exclude});
% `update_dictionary` joins the entries of `cfg_ts` and `cfg_jtfs`
cfg_jtfs = update_dictionary(cfg_ts, cfg_jtfs);
Make scattering objects
ts = Scattering1D(N, kwargs=cfg_ts, out_type=out_type);
jtfs = TimeFrequencyScattering1D(N, kwargs=cfg_jtfs, out_type=out_type);
Make "CWT" by having Scattering1D only return first-order unaveraged coefficients
% oversampling: high to disable subsampling and enable 2D concatenation
% out_type: 'list' since average=false doesn't permit 'array'
cwt = Scattering1D(N, kwargs=cfg_ts, ...
average=false, out_type='list', oversampling=99, max_order=1);
Apply the transforms
Pack CWT into array, and exclude S0
% `2:end` to exclude `S0`
cwt_x = cellfun(@(c)c{'coef'}, cwt_x(2:end), 'UniformOutput', false);
cwt_x = cat(1, cwt_x{:});
cwt_xs = cellfun(@(c)c{'coef'}, cwt_xs(2:end), 'UniformOutput', false);
cwt_xs = cat(1, cwt_xs{:});
Compute and display distances
% relative L2, `||x0 - x1|| / ||x0||`
rl2_ts = rel_l2(ts_x, ts_xs);
rl2_jtfs = rel_l2(jtfs_x, jtfs_xs);
'FDTS sensitivity (relative L2)\n' ...
], rl2_jtfs, rl2_ts, rl2_jtfs / rl2_ts)
FDTS sensitivity (relative L2)
JTFS = 0.33
TS = 4.15e-04
JTFS/TS = 803.0
Make and save GIF
scriptDir = fileparts(matlab.desktop.editor.getActive().Filename);
make_gif_fdts(saveDir, x, xs, ts_x, ts_xs, jtfs_x, jtfs_xs, cwt_x, cwt_xs)
Result:
- Left: signal, zoomed
- Right: (modulus of) CWT of signal
- Bottom-top: TS of signal, using T = signal duration
- Bottom-bottom: JTFS of signal (spinned coefficients only), using same T
TS doesn't visibly change at all, while a fair portion of JTFS does. The relative distances,
are (repeated for convenience)
Concerning the ratio:
- Lowering T lowers the ratio, as TS becomes more sensitive.
- Including all JTFS coefficients (i.e. non-spinned also) lowers the ratio, but the ratio stays high. That's because, by design, only spinned coefficients are FDTS-sensitive. In practice, we can choose to emphasize spinned coefficients, or exclude non-spinned.
Frequency shift insensitivity
A "frequency shift", in context of scattering, refers to a log-frequency shift. Like a horizontal shift in CWT is a time shift, a vertical shift in CWT is a log-frequency shift.
Illustrating, we take windowed sines, and shift them in log-frequency. We reuse gen_fdts, applying it twice with different initial frequency f0, and discard the shifted output:
[x0, ~] = gen_fdts(N, n_partials, partials_f_sep, seg_len, f0_0);
[x1, ~] = gen_fdts(N, n_partials, partials_f_sep, seg_len, f0_1);
Next, the scattering objects. Here we make CWT for visualization, and TS and JTFS for scattering. Parameter notes:
- F: we make multiple JTFS, each with its own F. For "F=0", we set F close but not equal to 0, since the library requires F>0. For "F='global'", refer to the explanation in the makePhiFrExactlyGlobalAveraging() helper.
The other parameter choices' rationale follows that of the preceding section.
% JTFS-specific parameters
out_exclude = {'S0', 'S1'};
ckw = dictionary(Q={Q}, r_psi={r_psi}, analytic={analytic}, ...
jkw = dictionary(average_fr={average_fr}, oversampling_fr={oversampling_fr}, ...
max_pad_factor_fr={max_pad_factor_fr}, ...
pad_mode_fr={pad_mode_fr}, out_exclude={out_exclude});
jkw = update_dictionary(jkw, ckw);
% make scattering/CWT objects
ts = Scattering1D(N, kwargs=ckw);
jtfs0 = TimeFrequencyScattering1D(N, kwargs=jkw, F=1e-16);
jtfs1 = TimeFrequencyScattering1D(N, kwargs=jkw, F=8);
jtfs2 = TimeFrequencyScattering1D(N, kwargs=jkw, F=32);
jtfs3 = TimeFrequencyScattering1D(N, kwargs=jkw, F=127.9);
cwt = Scattering1D(N, kwargs=ckw, average=false, out_type='list', ...
oversampling=99, max_order=1);
Turn the frequential lowpass filters into constants; refer to the helper method's docstring for an explanation.
makePhiFrExactlyGlobalAveraging(jtfs3)
Apply transforms
% pack CWT into array, and exclude `S0`
cwt_x0 = cellfun(@(c)c{'coef'}, cwt_x0(2:end), 'UniformOutput', false);
cwt_x1 = cellfun(@(c)c{'coef'}, cwt_x1(2:end), 'UniformOutput', false);
cwt_x0 = cat(1, cwt_x0{:});
cwt_x1 = cat(1, cwt_x1{:});
For accurate visualization of shift distances, second-order paths should be ordered n2-first (n2 being the index of a second-order "path"), while the library computes it n1-first. We reorder them here
[ts_x0, ts_x1] = reorder_S2(ts, ts_x0, ts_x1);
Compute and display distances
% relative L2, `||x0 - x1|| / ||x0||`
rl2_ts = rel_l2(ts_x0, ts_x1);
rl2_jtfs0 = rel_l2(jtfs0_x0, jtfs0_x1);
rl2_jtfs1 = rel_l2(jtfs1_x0, jtfs1_x1);
rl2_jtfs2 = rel_l2(jtfs2_x0, jtfs2_x1);
rl2_jtfs3 = rel_l2(jtfs3_x0, jtfs3_x1);
'Frequency shift sensitivity, JTFS/TS:\n' ...
'%.4f -- F=''global''\n'], ...
rl2_jtfs0 / rl2_ts, rl2_jtfs1 / rl2_ts, ...
rl2_jtfs2 / rl2_ts, rl2_jtfs3 / rl2_ts)
Frequency shift sensitivity, JTFS/TS:
0.9321 -- F=0
0.5681 -- F=8
0.2142 -- F=32
0.0079 -- F='global'
Make and save GIF
make_gif_freq_tp(saveDir, x0, x1, cwt_x0, cwt_x1, ts_x0, ts_x1, ...
jtfs0_x0, jtfs0_x1, jtfs1_x0, jtfs1_x1, ...
jtfs2_x0, jtfs2_x1, jtfs3_x0, jtfs3_x1)
Result:
- Top plots: left is signal, right is its scalogram.
- Bottom plots: top is TS, below it are JTFS with varying F.
Shift-insensitivity is applied analogously to time-shifts, with the following distinctions:
- Unlike time-shifts, a shift in log-frequency is inexact. That is, the effect on the transform, in case of CWT, isn't actually
; in short, a frequency shift in CWT causes extension in time, since the wavelets are multi-resolution. - For linear frequency shifts, i.e. shifts of form
, where
is the Fourier transform of x, there's still imposed insensitivity, but its extent is controlled in a (even more) non-straightforward manner.
The JTFS-to-TS relative distance ratios are (repeated for convenience)
For:
- F=0, i.e. no frequential averaging, the ratio is lower than 1 due to frequential scattering, which smoothes along frequency.
- F=8, the ratio is significantly lower, and for F=32 it's yet lower, but both are considerable for the shift in our example.
- F='global', the ratio is very low, making JTFS practically shift-invariant. The ratio not being zero is consistent with the earlier note on CWT, but for most practical purposes, the difference is negligible.
Example: exponential chirp
We further illustrate JTFS's FDTS-discriminability by inspecting its "spin assymetry" in response to an exponential chirp (echirp).
An echirp is the "ideal" signal, as it forms a (approx.) straight line in log-frequency. It can be formed (approx.) by shifting a delta (unit impulse), by a constant in time per a constant shift in log-frequency. Despite being remarkably different, time scattering cannot distinguish them [3] (1). We make an echirp, and show its scalogram. First, configure scattering. For illustration, we show the pre-averaged coefficients. Except for the following, parameters follow same logic as in other sections, or are simply tailored for visualizing our specific signal:
- average, average_fr: both false, as per our objective.
- oversampling, oversampling_fr: set to maximum to disable subsampling, as we require equal dimensionalities for visualization.
- T, F: still apply to non-spinned pairs, so we match them against J and J_fr (consistent with standard idealized configuration (2)).
- paths_exclude: for a more compact visualization, keep number of coefficients limited.
- max_noncqt_fr: our "CWT" is a mix of CWT and approximate STFT; setting max_noncqt_fr=0 disables including any non-CWT coefficients into joint pairs, which aids with example's idealization.
paths_exclude = dictionary(n1_fr={1}, j2={[1 2 3]});
% JTFS-specific parameters
sampling_filters_fr = 'resample';
Q={Q}, J={J}, T={T}, average={average}, oversampling={oversampling}, ...
max_pad_factor={max_pad_factor}, pad_mode={pad_mode}, ...
paths_exclude={paths_exclude}, out_type={out_type}, ...
J_fr={J_fr}, Q_fr={Q_fr}, F={F}, average_fr={average_fr}, ...
out_3D={out_3D}, oversampling_fr={oversampling_fr}, ...
max_pad_factor_fr={max_pad_factor_fr}, pad_mode_fr={pad_mode_fr}, ...
sampling_filters_fr={sampling_filters_fr}, max_noncqt_fr={max_noncqt_fr} ...
jtfs = TimeFrequencyScattering1D(N, kwargs=jkw);
Generate the echirp. Span frequencies from low to Nyquist. fmin shouldn't be too low as it exacerbates low-frequency time warping and deviates from the straight-line idealization.
x = wavemat.toolkit.echirp(N, fmin=64, fmax=N/2);
Take JTFS
Compute and display energy ratio
up = cellfun(@(c)c{'coef'}, Scx{'psi_t * psi_f_up'}, 'UniformOutput', false);
dn = cellfun(@(c)c{'coef'}, Scx{'psi_t * psi_f_dn'}, 'UniformOutput', false);
E_up = sum(up.^2, 'all');
E_dn = sum(dn.^2, 'all');
fprintf('E_down / E_up = %.3g\n', E_dn / E_up)
Make and save signal and scalogram plot. We fetch the scalogram from S1, since average=false.
make_scalogram(x, cwt_x, saveDir)
Make and save the 2D GIF visual.
make_gif_echirp2d(jtfs, Scx, saveDir)
Next, rebuild JTFS with time averaging and no excluded paths, and re-scatter. Use average=true so that our GIF has a manageable number of frames, and all paths so each 3D frame is richer.
jkw('oversampling') = [];
jkw('paths_exclude') = [];
jtfs = TimeFrequencyScattering1D(N, kwargs=jkw);
Make and save the GIF
baseName = 'echirp_jtfs3d';
wavemat.visuals.gif_jtfs_3d( ...
Scx_full, jtfs=jtfs, preset=preset, base_name=baseName, savedir=saveDir, ...
cmap_norm=.5, overwrite=1, verbose=0)
Result:
Each JTFS joint coefficient is output from a 2D convolution (rather, cross-correlation) with a 2D wavelet. We show the wavelets, and their corresponding coefficients:
- Vertical axis: xi1_fr, log-quefrency, in cycles/octave (scattering along log-frequency).
- Horizontal axis: xi2, log-frequency, in cycles/sample (second-order scattering along time).
- To convert xi2 to Hertz: (1) multiply by sampling rate, fs (e.g. if fs=1000 Hz, then xi2=0.0125 corresponds to 1000*0.0125=12.5 Hz); (2) pass fs to viz_jtfs_2d in make_gif_echirp2d.
Positive xi1_fr is "spin up", which is "phase contour down" and corresponds to downward-sloped geometry in time-frequency. Negative xi1_fr is "spin down", opposite to up. Coefficients with xi1_fr=0 or xi2=0 are complementary "unspinned" pairs.
We observe strong response in spin down, and vanishing response in spin up. The ratio of their energies is
which is an excellent signal to a classifier, and useful for audio synthesis and as a feature in general.
This "spin assymetry" is a core feature of JTFS. It is FDTS-sensitive asymmetry within the coefficients of a single transform, which is something out of reach for TS. JTFS accomplishes this by breaking TS's tree structure and operating jointly along time and frequency, while TS operates on one vector at a time, agnostic to structures of neighboring features.
To visualize the evolution of JTFS along time, we make a 3D heatmap, with each temporal slice capturing time-frequency intensities across three degrees of freedom: xi1, xi2, and xi1_fr (where xi1 is first-order scattering along time)
We observe the response over time as a smooth morphing from lower to higher xi1, with minimal variation along xi2 and xi1_fr. This is as expected, as variation along xi2 and xi1_fr implies amplitude modulation along time and frequency, respectively, over time, which doesn't happen for an echirp (3). We also observe that the contents of opposing spin occur only for t near start and finish; this is reflective of sharp boundaries, i.e. lack of windowing. Indeed, the energy ratio can otherwise be driven much higher.
Disentangling variability
JTFS linearizes and disentangles core sources of signal variability: carrier frequency
(frequency offset), modulation frequency
(A.M.), and chirp rate γ(F.M.). We reproduce and explain this result from [2]. Denoting the triplet
with
, we have
,which in order, reads window * modulator * carrier, and where
is the carrier's instantaneous frequency. (Its derivative is the argument to the sine of carrier, since the integral of the argument is the frequency.)- γ is in octaves per second, measuring the rate of change of frequency. An "octave" is a doubling in frequency. For
, the frequency doubles over two seconds.
, the (amplitude) modulator's frequency, is constant.
is a Tukey window with effective support of
, where w is number of octaves. It's such that, w octaves are spanned by the effective support. (E.g. if
octaves/second and
octaves, it'll take
seconds to sweep two octaves.)
To make coefficients more comparable across different
, we use a fixed w, making coefficients have the same effective (frequency) span, so the only differences are due to energies and (frequency) offsets of coefficients. We use
. Note, some implementation specifics, and value of w, differ from that of [2]. Visualize a few
: - First, second, third rows: x zoomed, x more zoomed,
zoomed. - x amplitudes (particularly in last column) differ since the x are normalized to have same energy (not shown in the expression for
). - Row 3, plot 1: vertical center is at 512 Hz, consistent with
. The effective duration is about 2 seconds, matching
. We count 8 modulation cycles (including fading ones), consistent with
. - Row 3, plot 2: vertical center is at 1024 Hz, all else same, consistent with this
. - Row 3, plot 3: vertical center is unchanged, slope (chirp rate) is unchanged, duration is unchanged, only modulation count (and rate) changes, consistent with this
. - Row 3, plot 4: vertical center is unchanged; slope is increased, and to satisfy
, duration is decreased. Since duration is decreased, there are fewer modulations in total, but the modulation rate (per unit time) is same.
We see that in each case, the total vertical span (bandwidth) is unchanged.
To compare auditory similarity retrieval in TS and JTFS, we utilize the Isomap algorithm for nonlinear dimensionality reduction, as in [2]. Isomap reduces dimensionality by, in rough terms, grouping points by their "true" distribution: The plots show the "Swiss roll" dataset (left), and its isomap (right), approximately reproduced from the original isomap paper.
- Large black dots = reference points.
- Blue solid line = geodesic path between the points.
- Blue dashed line = Euclidean ("straight line") path between the points.
- Red line = neighborhood graph constructed by Isomap, using n_neighbors = 7.
Quoting the source, "For two arbitrary points [large dots] on a nonlinear manifold, their Euclidean distance in the high-dimensional input space (length of dashed line) may not accurately reflect their intrinsic similarity, as measured by geodesic distance along the low-dimensional manifold (length of solid curve)." Isomap approximates manifold distances by accounting for progression of distances; it connects two points by hopping along points in-between, and minimizing total distance hopped. Isomap (the plot on right) minimizes the overall error of all such connections.
We build Isomaps for TS and JTFS, sweeping
logarithmically with sweep size 16, forming a total of 4096 signals. With
, we obtain the following Isomaps: - A point = transform of an x for a given
. - Blue = low
, red = high
, white = in-between. - Points closer together are transforms of
whose geodesics are closer (see below). That is, closer points = more similar transforms.
For TS, we see blue and red points packed close together. This means that signals with significantly different
have similar time scatterings, which is undesired. Meanwhile, JTFS's isomap is 3-D, matching that of our 3-D parameter space, and points are distributed approximately linearly (in a cube shape) and uniformly, maintaining more even gaps for differences in parameter space. TS reliably maps
and γ (see [2]), forming a 2-D manifold, but fails with
. The manifold being 2-D makes it incapable of mapping distances consistent with the parameter space, which is 3-D. If we nudge
or γ, we nudge along one of the two dimensions in TS's 2-D manifold; if we nudge
, it will also nudge along one of the two dimensions, making it entangled with the other parameters. JTFS disentangles all three parameters. Why this difference? We know from [2] that TS doesn't struggle with
(which lacks frequency resolution), so we hypothesize that higher Q degrades the similarity retrieval of amplitude modulations. What is the dependence, and is there a "tipping point" Q? We investigate by visualizing: TS's 3-D Isomaps gradually collapse into 2-D along the
dimension. While there isn't a strict point at which the collapse is "finalized", the manifold is visually flat past around
. Meanwhile, JTFS retains its locally and globally quasi-linear distribution. To understand why this happens, we fixate at the highest visualized Q of 16:
We observe, that the individual
rows are highly similar in shapes. Following the GIF sweeping
to
, we see this is per progressively degrading time resolution, temporally expanding narrow modulation cycles. At the same time, the energies of columns (sum of squares along time slices) encode substantially different modulation rates. Indeed, while the overall first-order (unaveraged) transform fully encodes
, it does so over multiple coefficients (rows); as TS operates on per-coefficient basis, it is agnostic to adjacent coefficients, and this spread-out encoding is not reflected at the second order. JTFS, on the other hand, operates jointly in time and frequency. Above the Isomap plots, we see the TS's coefficients are nearly same, while JTFS's have definite differences. Lastly, we distinguish coefficient "morphing" behaviors by sweeping parameters in order:
In this case, the transforms perform comparably. To show a difference, the entangled dimensions must be explored. As both transforms are frequency shift equivariant, both perform well with
; indeed, the well-separated "segments" for TS isomaps are due to
. This leaves γ as the entangled dimension. As the transform with an extra degree of freedom and explicit sensitivity for FDTS, JTFS excels at mapping out γ, while TS relies on "surrogate" variability in individual coefficients. As we've seen earlier, JTFS is sensitive to substantial parameter differences, while TS can conflate them. To illustrate the problem, we fix
and γ, for two choices of γ: one where TS works well, and one where it doesn't. And, we sweep
. Let
be the initial
. We observe from coefficient differences, and relative Euclidean distances, that for one of the γ, both JTFS and TS consistently grow away from
. For the other γ, however, while JTFS behaves as before, TS first quickly grows away from
, then grows closer to it, despite
still growing away from
. This shows that TS distances do not closely correspond to parameter distances in
. To show that
and γ are entangled, we consider what happens for a neighboring choice of γ relative to the poorly-performing γ. Suppose this neighbor is well-behaved, and distances always grow with growing
. Since this isn't the case for the reference γ, however, points close to
in the neighbor γ can be about equally close to
in the reference γ. This means that the points cannot be distributed along separate, equally-spaced 1D manifolds. One candidate, with just two γ considered, is that there are two 1D manifolds that share a point (e.g.
) and are angled with respect to it, such that additional γ would form a circle. As differing γ generates a distance (so many γ with same
cannot intersect), this configuration doesn't occur; instead, the 1D manifolds, while separated, are tightly and irregularly packed, and sometimes (approximately) intersect. Hence, the 1D manifolds do not group into a well-behaved extra dimension, and there is entanglement: points with different
or different γ can be close, and incrementing either doesn't consistently increment distances. To visualize these observations, we compare a Isomap slices with our fixed
: TS Isomap is compressed vertically (which, as we'll see, is the
dimension), and points within it are irregularly spaced and irregularly ordered. Meanwhile, JTFS Isomap is roughly uniform both vertically and horizontally. We now show Isomap sub-slices with fixed γs, specifically utilizing the poorly-behaved one from our example, and one of its neighbors:
We see that for different γ, JTFS is distributed along two distinct, roughly equally spaced 1D manifolds, while TS is irregularly spread out in 2D. JTFS is well-ordered, while TS has instances of points such that, farther
are closer than closer
(blue points close to red points). JTFS hence, while preserving frequency resolution, disentangles foundational signal building blocks: frequency offset, frequency modulation, and amplitude modulation.
Disentangling variability (extended)
Placeholder text for explanation
Placeholder text for explanation
Placeholder text for explanation
Time shift insensitivity
Wavelet scattering's signature property, insensitivity to time shifts,
, also carries to JTFS. To illustrate, we - Tukey-window White Gaussian Noise of 4096 samples.
- Shift it by 256 samples.
- Apply TS and JTFS to the original and its shifting, using same T for both. For JTFS, apply it using multiple F, from 0 to 'global'.
- Compute relative distances between transforms of shifted and unshifted signals.
- Repeat 3-4 for all T, from 0 to 'global', swept in powers of 2 (0, 1, 2, 4, 8, ...).
Start with the signal
% make padded Tukey window
win = [zeros(1, N/4), tukeywin(N/2, 0.5).', zeros(1, N/4)];
% make windowed White Gaussian Noise
x1 = circshift(x0, shift);
Make common scattering configurations to be reused in T, F sweeps. As before, except for following parameters, parameter logic is shared with other sections:
- out_3D=true minimizes information losses due to unpadding by unpadding more conservatively, yielding more accurate measurements (4).
- For F=0, we instead set F close to machine epsilon, to avoid internal numeric errors; the difference in outputs is negligible.
F_all = {1e-16, 2, 8, 'global'};
out_exclude = {'S0', 'S1'};
sampling_filters_fr = 'resample';
Q={Q}, r_psi={r_psi}, pad_mode={pad_mode}, ...
max_pad_factor={max_pad_factor}, precision={precision});
out_exclude={out_exclude}, max_pad_factor_fr={max_pad_factor_fr}, ...
pad_mode_fr={pad_mode_fr}, sampling_filters_fr={sampling_filters_fr}, ...
out_3D={out_3D}, average_fr={average_fr});
Initialize loop variables
rl2_ts_all = dictionary();
rl2_jtfs_all = dictionary();
rl2_jtfs_all{F(1)} = dictionary();
% dyadic scale of `N`, for sweeping
Iterate
% Internally, `T=2^log2_N` acts same as `T='global'` (in general, it's
% rather if `T=2^ceil(log2(N))`, but here that's same as our `log2_N`). See
% `T` docs for an explanation of (and caveats to) this design choice.
ckw{'oversampling'} = 99;
ts = Scattering1D(N, kwargs=ckw);
ts_x0 = ts_x0(:, 2:end, :);
ts_x1 = ts_x1(:, 2:end, :);
rl2_ts_all(log2_T) = gather(rel_l2(ts_x0, ts_x1));
jkw = update_dictionary(jkw, ckw);
jtfs = TimeFrequencyScattering1D(N, kwargs=jkw);
% select joint pairs (in `{2}` per `out_3D=true` with `out_type='array'`)
rl2_jtfs_all{{F}}(log2_T) = gather(rel_l2(jtfs_x0, jtfs_x1));
Make and save plots
plot_T_sweep(rl2_ts_all, rl2_jtfs_all, F_all, saveDir)
Print distances
fprintf(['Relative L2 for T=''global'':\n' ...
'%.2e -- JTFS, F=%d\n' ...
'%.2e -- JTFS, F=%d\n' ...
'%.2e -- JTFS, F=%d\n' ...
'%.2e -- JTFS, F=%s\n'], ...
rl2_jtfs_all{{F1}}(log2_N), 0, ...
rl2_jtfs_all{{F2}}(log2_N), F2, ...
rl2_jtfs_all{{F3}}(log2_N), F3, ...
rl2_jtfs_all{{F4}}(log2_N), "'global'")
Relative L2 for T='global':
1.30e-16 -- TS
5.68e-16 -- JTFS, F=0
5.75e-16 -- JTFS, F=2
7.89e-16 -- JTFS, F=8
6.48e-17 -- JTFS, F='global'
Result:
A clear and monotonic dependence is observed for distance versus T, and versus F. For T='global', where we expect invariance, i.e. zero distance, we obtain (repeated for convenience)
which is within float precision of zero. Note how, unlike with F='global', invariance with time shifts is faithfully reproduced, as unaveraged CWT (and by extension scattering) is exactly equivariant to time shifts, but not to frequency shifts.
For JTFS, we observe that higher F yields lower time-shift distances. Indeed, frequency averaging also has a time averaging effect, which is converse of the result shown in [9] that time averaging has a frequency averaging effect. In simple terms, - Recall the echirp
earlier, and imagine shifting it by a small amount horizontally (i.e. in time), until it no longer overlaps itself. - Now imagine lowpass filtering it vertically (i.e. along frequency). The scalogram is blurred, or "widened", vertically, and this additional support makes it overlap with its shift again, lowering distance.
- With global averaging, F='global', where the scalogram completely stretched out vertically (and can be collapsed to 1D, shaped [1 N]), this effect is maximized, and most of the scalogram overlaps its shifted self.
We also observe that for some T, the JTFS distances with F=0 slightly exceed that of TS. This is a legitimate effect and not a numeric artifact. In short, this isn't a concern in practice and insensitivity works as expected; refer to (5) for further discussion. Warp-stability, noise-robustness
We state known properties of the scattering transform, without illustrating:
- Warp-stability: a time warp is a deformation of form
, where
(
implies time stopping, and
time reversal;
is a pure shift and is included). TS is proven to be stable for up to
in [8] (also see (6)). "Stable" in context means, in simple terms, that the distance between transforms of
and
is bound by the "size" of deformation, not characteristics of x. - Noise-robustness: TS is stable against additive noise. This follows its contractiveness property:
, where S is scattering [9]. If
, it says that distance between scattering of x and noisy x cannot exceed the distance between x and noisy x.
JTFS inherits noise robustness, since it inherits contractiveness. Contractiveness is inherited since it is a direct consequence of energy non-expansiveness of wavelet filterbanks (by design), and of the complex modulus operator [9]. JTFS also inherits warp-stability, but it's not identical to TS's. That is, following the earlier observation in the "Frequency shift insensitivty" section on distances with F=0, JTFS distances aren't required to follow TS distances, and can exceed them. Warp stability proofs in [8] are for TS only, and while distance behaviors are expected to agree closely in general, the exact relationship is unknown (to best of author's knowledge). Applications of JTFS properties
- Audio classification: strong performance was shown in a wide variety of tasks [2] [3] [4] [5].
- Biomedical classification: strong performance was shown in [10], but existing research is limited.
- Audio synthesis: advantages over TS for audio texture synthesis, in better preserving alignment of partials upon high T, are shown in [2], [3]. [6] showed how spinned coefficients can be used to locally time-reverse a signal. [2], [5] showed that JTFS accurately models auditory similarities, and [7] additionally showed how JTFS, unlike the spectrogram or scalogram, is viable for mesoscale parameter retrieval.
Properties utilized:
- FDTS discriminability: taking an audio signal with many transients, such as a jackhammer in a street scene, and misaligning its subbands yields a completely different sound [3]. Seismic signals can exhibit frequency-dependent dispersion characteristics, which can be used to assess changes in the structural integrity of a medium, or to analyze the Earth's structure. Different mechanical faults (e.g. bearing wear, gear damage, shaft misalignment) alter a system's stiffness and damping, causing FDTS in vibration signals; FDTS insensitivity would hinder fault diagnosis.
- Frequency-shift insensitivity: speaker identification is sensitive to pitch, but speech recognition in non-tonal languages isn't [3]. Same instruments can be played at different pitches, insensitivity hence aids musical instrument recognition. As ECG can vary by a person's baseline heartrate, insensitivity aids with recognizing arrhythmia patterns across patients.
- FDTS sensitivity: crucial to JTFS's generality is that "FDTS" need not be some explicit transformation. JTFS doesn't know, nor cares, whether this occurred. Only variation in time-frequency is needed, which is true of anything that's not a pure sine. The utility of JTFS lies in preserving frequency modulation information against time averaging using much fewer scattering orders than TS. While both can discriminate a unit impulse from an exponential chirp, TS requires many more orders and coefficients than JTFS to do so with a high T. Another utility is in having a richer decomposition: the added degree of freedom via frequential scattering introduces "quefrency", which by itself separates F.M. and A.M. more granularly than TS by shape and size of variation, but also together with second-order "frequency", adds the "slope" descriptor, decomposing F.M.s by their rates of change. While explicit FDTS phenomena are limited, frequency modulations are nearly universal.
- Time-shift insensitivity, time-warp stability, noise-robustness: foundational scattering properties which JTFS inherits. In synthesis, shift-insensitivity benefits parameter retrieval (e.g. [7]), as the input recording can be arbitrarily shifted without affecting parameters of interest. In classification, classes are ubiquitously invariant to time shifts; delayed trumpet or tachycardia are still trumpet and tachycardia. General or further discussion of benefits of shift-insensitivity, or of warp-stability or noise-robustness can be found in scattering literature (e.g. [9]).
JTFS vs TS: when to use which?
Synthesis: it's matter of what signal parameters one wishes to control:
- JTFS provides more "knobs" to turn (kinds of coefficient manipulations to use), enabling finer control. That is, unless, we pursue third- or higher-order TS, where the "knobs" between the transforms differ. However, JTFS can simply be paired with higher orders of TS.
- JTFS's preserving of FDTS sensitivity through increased time-shift insensitivity can also be beneficial, where said insensitivity is needed.
- TS is undoubtedly faster, and can be sufficient if JTFS's added properties aren't relevant.
Classification: the mainstream advice is, use TS when
- FDTS not relevant: "... this increased discriminative power may not always be desirable. For example, frequency-dependent time-shifts ... or similar transformations may not be relevant for the classification task. In this case, replacing the time scattering transform with a joint time-frequency scattering transform would needlessly increase the number of model parameters, potentially requiring more training data to train an accurate classifier." [3]
- T is small: "For phone identification, we therefore require the invariance scale T to be of this order. Since T is small, there is less room for the type of misalignment seen in Section III-A. We therefore expect the joint time-frequency scattering to provide only limited improvement over time scattering." [3]
We note, however, that the amount of published work involving JTFS is limited, especially that which analyzes its classification success. Without publishing, the original author of wavemat and this article has studied the matter in depth, and advises the following:
- JTFS beats TS, almost always. For numerous reasons beyond the scope of this example, JTFS can be considered as strictly superior to TS for classification. A part of reasoning concerns the description in previous section on "FDTS sensitivity". If a real-world task requires an automated classifier, odds are extremely low that it's comprised mainly of stationary signals without frequency varying over time. Moreover, JTFS can be paired with higher-order TS, if loss of information due to high T is a concern. Size of transform (hence dimensionality) is controllable, but the idea is, it's a price worth paying. With absence of extensive ablation studies, "almost always" is the author's bet.
- Compute-efficiency: despite having superior performance, JTFS's performance-to-compute ratio can be worse. A priori determination of said ratio isn't practical. However, with "amount of time-varying frequency" (relative to stationary contents) mattering most, one can define a heuristic to measure it, and iterate over the dataset. While outside the scope of this example, we provide a summary in (7).
- T is small: if small T is preferred for whatever reason, then TS is likelier to outperform JTFS, as latter is overly redundant and high-dimensional (in agreement with "mainstream"). Compensating for reduncancy with large F isn't generally safe, as the frequency axis holds a lot of important information. However, it should be rare that for reasons other than compute efficiency, a small T is preferred over JTFS + higher order TS with high T, even if task-relevant structures are narrowly localized in time.
- Simplicity: designing an appropriate learning algorithm to go on top of JTFS (e.g. convolutional networks) is more challenging than for TS. Also, TS generally involves less paramter tuning. For appropriate tasks, JTFS may nonetheless work better out-out-box or require less tuning: the parameters that matter most are T, Q, and F if average_fr=true. Worth trying are Q=8,16,24 and T=N/8,N/32 as starters.
Extended notes
(1) Strictly speaking, it can, but it would take a great number of orders, more than in virtually any application. TS is only invariant to two kinds of transformations: a global phase shift, and a global time shift if T='global'. Yet, at the same time, due to its tree structure, TS cannot distinguish between
and any arbitrary global time-shifting of individual its frequencies λ, including if said shifts are different. What keeps TS discriminative is the fact that such transformations are impossible: there aren't x that yield such X. This includes the exponential chirp formed from a unit impulse: such a chirp would need to be correlated with each frequency λ at only one time instant, which is impossible. The power of JTFS lies in achieving this discriminability with only two orders. (2) The idea is energy conservation, which in turn ensures that features aren't over- (or under-) represented. F<2^J_fr, for example, would raise redundancy (create information overlap) of psi_t * phi_f and phi_t * phi_f pairs relative to unaveraged spinned pairs, and their energies would exceed conserving amounts. T<2^J has this effect in terms of orders, making higher orders recover more information than was lost in previous. Since our max_order<inf, and spinned can be averaged, both can be beneficial, but not for accurately representing each order (as a decomposition; and, in this case, we require average_fr=false).
(3) Strictly speaking, it is variation in time and log-frequency slices of
, over time, which does happen, due to the wavelets varying in time support and bandwidth as a function of xi1. This variation is small for small sweeps across xi1. (4) In "idealized" sense, which matters for direct comparisons against TS, as TS doesn't incur unpad losses along frequency. Depending on context, this can manifest as overestimates or underestimates of a given metric. In this case, the difference against out_3D=false is barely perceptible, but we use true as good practice. In other sections, this was deemed less important, or more code-inconvenient (out_3D=true requires average_fr=true, and changes output handling syntax with out_type='array'). Note, there's also time unpad losses; in short, all results (in all sections) remain faithful to intent, but for details, see (8). (5) Spinned coefficients can significantly increase or decrease time-shift sensitivity relative to time scattering, depending on input signal. A decrease occurs per spatial smoothing along frequency axis, due to frequential scattering. An increase occurs per the scalogram being effectively "split up" between the two spins, lowering per-spin coefficient overlap upon a time shift, and driving up distance. The increase can persist, and remain significant, even with F='global' in extreme (and impractical) cases. In practice, we expect sensitivity to behave approximately as in our plot, especially for high T. (Note: due to time unpad losses (see (4)), the gap between F=0 and TS in the plot is exaggerated, though said gap is still small; also see (9).) (6) While [9] states the result for
, [8] proves it only for
. Note, "size" of deformation includes
and
, respectively, the gradient of deformation, and size of deformation (with J dependence). [8] states that
can be replaced with
if the additional dependence of
is introduced to the bounding distance. Yet, as
, this term explodes, making the bound non-existent. For
not too close to 1, we still see some approximate stability, but the scattering distance bound is no longer linearly proportional to size of deformation. (7) Ratio of spinned coefficients' energy to non-spinned joint energy ("joint" = non-S0, S1), and, relative Euclidean distance between unaveraged (in time and frequency) spin up and down coefficients, should both be sufficiently large. So, E_spinned / E_nonspinned, and rel_l2(up, dn). From limited experimentation, "sufficiently large" are, respectively, 0.1 and 0.5. For the first metric, T = 2^J and F = 2^J_fr must be used (and default J and J_fr work well), and Q, r_psi should match what will actually be used.
(8) Any T='global' isn't affected at all, as there isn't unpadding. The frequency transposition example is negligibly affected. In the time-shift insensitivity example, the difference in plots is perceptible, but small. (The explanation for the two aforementioned cases concerns effective signal support given windowing, hence the amount of unpad losses, and whether time shifting is involved, which also affects losses.) The echirp example's ratio does change substantially (in testing, from
to
), which is as expected as the boundary effect is exacerbated (more of it is kept), but a more realistic chirp example (and incidentally, more idealized) involves windowing, which'd greatly reduce this effect; windowing was omitted to illustrate this effect. (9) The precise curves expected for F=0 vs TS for White Gaussian Noise weren't studied, but by reimplementing JTFS to include more scattering paths, observed were much fewer instances of F=0 distances exceeding TS's, and a greater (though still small) gap of TS exceeding F=0 for small T. This was only tested for the given shift, Q, and r_psi. Note that these differences against what's obtained in this example aren't practically significant, they're only for if we wish to be very precise (or to know if F=0 exceeding TS always holds for WGN).
References
[3] J. Andén, V. Lostanlen, S. Mallat. "Joint Time-Frequency Scattering", in IEEE Transactions on Signal Processing, vol. 67, no. 14, pp. 3704-3718, 2019. doi: 10.1109/TSP.2019.2918992. https://ieeexplore.ieee.org/document/8721532 [4] J. Andén, V. Lostanlen, S. Mallat. "Joint time-frequency scattering for audio classification", in IEEE 25th International Workshop on Machine Learning for Signal Processing, 2015. doi: 10.1109/MLSP.2015.7324385. https://ieeexplore.ieee.org/document/7324385 [5] V. Lostanlen, C. El-Hajj, M. Rossignol, G. Lafay, J. Andén, M. Lagrange. "Time–frequency scattering accurately models auditory similarities between instrumental playing techniques", in EURASIP J Audio Speech Music Process, 2021. doi: 10.1186/s13636-020-00187-z. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7801324/ [6] V. Lostanlen, F. Hecker. "The Shape of RemiXXXes to Come: Audio Texture Synthesis with Time-frequency Scattering", in arXiv, 2019. doi: 10.48550/arXiv.1906.09334. https://arxiv.org/abs/1906.09334 [7] C. Vahidi, H. Han, C. Wang, M. Lagrange, G. Fazekas, V. Lostanlen. "Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis". Journal of the Audio Engineering Society, 2023, 71 (9), pp.577-585. hal-04118474. https://hal.science/hal-04118474/document [10] J. Muradeli. "Sleep Stage Classification using Joint Time-Frequency Scattering", 2024. MathWorks URL